This notebook is based on the conference paper available here: http://www.ibpsa.org/proceedings/BS2019/BS2019_210725.pdf
Paper Abstract: This paper demonstrates a clustering method for grouping PVs of arbitrary orientation affected by nonuniform local shading. For a project with 44,000 PV panels cladding doubly curved roof surfaces, every panel has a unique orientation. In addition, clerestory windows cause non-uniform near-field shading. The combination of curvature and shading causes every panel to receive a different amount of irradiance at any time. The PV panel with lowest output (lowest irradiance) in a MPPT string limits the output of the whole string. The method for grouping 44,000 PV panels into 296 MPPT strings affects the total annual energy producing. The PV array grouped with clustering generated 7.7% more energy annually compared to the same array grouped in a simple grid.
The clustering method from the paper (which used octave) has be re-coded using python and scikit-learn in this notebook.
Here we're exploring the methods (and a few additional methods) with roof bay 5 as an example. Roof bay 5 contains 1175 PV panels, has a shadow causing clearstory at the upper left edge.

%matplotlib notebook
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import time
The first file to load is the position and normal vector for all the PV panels. These positions represent the location of the panels in real world space with +Y as north, +X as east, and +Z as up.
PV_geometry = pd.read_csv('bay_05_wnormals.txt',
index_col=False,
sep='\t',
names=['x','y','z','xdir','ydir','zdir'])
PV_geometry.head()
The next file to load conatins irradiance on the panels at hourly increments on sunny days. The days are chosen from a TMY weather file, and the panel irradiance was simulated using the diffuse and dirrect irradiance, and position of the sun from the increments in the weather file. There are 73 hours from seven days. These are the values used to cluster the PV panels.
The days selected are: 1/21, 2/25, 3/24, 4/21, 5/21, 6/27, and 12/23
The datafram has 73 columns (one for each daylight hour on the 7 days) and 1175 rows (one for each PV panel on bay 5
sunny_days= pd.read_csv('sky/sunnydays.wea',
index_col=False,
skiprows=6,
header=None,
sep=' ',
names=['month','day','hour','DNI','DHI'])
sunny_days['h'] = np.floor(sunny_days['hour'])
sunny_days['m'] = 60 * ( sunny_days['hour'] - sunny_days['h'] )
sunny_days['timestamp'] = sunny_days.apply(
lambda x : f"{x['month']:0.0f}/{x['day']:0.0f} {x['h']:0.0f}:{x['m']:0.0f}",
axis=1)
PV_irrad_sunny_days = pd.read_csv('bay_05_irrad_sunnydays.txt',
index_col=False,
skiprows=7,
header=None,
sep='\t',
names=sunny_days['timestamp'].tolist())
PV_irrad_sunny_days.head(10)
The final file to load contains Irradiance on each PV panel for all hours in the TMY weather file. The dataframe has 8760 columns (one for each hour of the year) and 1175 rows (one for each PV panel).
all_days= pd.read_csv('sky/Moffett.wea',
index_col=False,
skiprows=6,
header=None,
sep=' ',
names=['month','day','hour','DNI','DHI'])
all_days['h'] = np.floor(all_days['hour'])
all_days['m'] = 60 * ( all_days['hour'] - all_days['h'] )
all_days['timestamp'] = all_days.apply(
lambda x : f"{x['month']:0.0f}/{x['day']:0.0f} {x['h']:0.0f}:{x['m']:0.0f}",
axis=1)
PV_irrad_all_days = pd.read_csv('bay_05_irrad_alldays.txt',
index_col=False,
skiprows=7,
header=None,
sep='\t',
names=all_days['timestamp'].tolist())
PV_irrad_all_days.head()
## this is a helper function to plot PV panels
def PV_scatter(position_x, position_y, color, cmap='rainbow', vmin=0, vmax=1000, size=3, title=''):
fig = plt.figure(figsize=(6,6))
plt.scatter(position_x, position_y, c=color,
s=size, cmap=cmap, vmin=vmin, vmax=vmax)
plt.gca().set_title(title)
plt.gca().spines['left'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['bottom'].set_visible(False)
plt.gca().xaxis.set_visible(False)
plt.gca().yaxis.set_visible(False)
irrad_max = PV_irrad_sunny_days.max().max()
column_to_plot = 3
PV_scatter(PV_geometry['x'], PV_geometry['y'],
color=PV_irrad_sunny_days.iloc[:,column_to_plot],
size=20,
vmax=irrad_max,
title='PV panel irradiance values (W/m2)\n for ' + PV_irrad_sunny_days.columns[column_to_plot])
Plotting hourly irradiance maps for a sunny day in May.
irrad_max = PV_irrad_sunny_days.max().max()
irrad_plot, axs = plt.subplots(3,4,figsize=(10,7.5))
irrad_plot.suptitle('PV irradiance for differnt times of day on May 21')
first_chart = 38
for i,ax in enumerate(axs.ravel()):
ax.scatter(PV_geometry['x'], PV_geometry['y'], c=PV_irrad_sunny_days.iloc[:,i+first_chart],
s=2, cmap='rainbow', vmin=0, vmax=irrad_max)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ax.set_title(PV_irrad_sunny_days.columns[i+first_chart])
irrad_plot.figsize=(16,12)
Reducing from 73 irradiance dimensions to 2 dimensions for plotting.
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca_irrad_sunny_days = pca.fit_transform(PV_irrad_all_days)
print("explained variance ratio: ",pca.explained_variance_ratio_)
The explained variance ratio describes how much of the total variance in the data is captured by each dimension in the PCA. In this example, the first dimension (X axis) recreates 68.8% of the total variance in the 73-dimensional data. The second dimension (y-axis) recreates 15.1% of variance in the 73-dimensional data. So with this PCA transform we can illustrate 84% of the variance in the original data with 2 dimensions.
And now to plot the PV panels in the PCA transform.
PV_scatter(pca_irrad_sunny_days[:,0], pca_irrad_sunny_days[:,1],
color=[(1,0,0)], size=20, title='2D PCA plot of PV panels')
irrad_max = PV_irrad_sunny_days.max().max()
irrad_plot, axs = plt.subplots(3,4,figsize=(10,7.5))
irrad_plot.suptitle('PV irradiance for differnt times of day, plotted on PCA axis')
first_chart = 39
for i,ax in enumerate(axs.ravel()):
ax.scatter(pca_irrad_sunny_days[:,0], pca_irrad_sunny_days[:,1], c=PV_irrad_sunny_days.iloc[:,i+first_chart],
s=2, cmap='rainbow', vmin=0, vmax=irrad_max)
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ax.set_title(PV_irrad_sunny_days.columns[i+first_chart])
irrad_plot.figsize=(16,12)
Plotting side-by-side irradiance maps in physical location and PCA coordinates.
## this is a helper function to plot irradiance on real world location alongside PCA coordinates
def PV_double_scatter(dot_color, cmap='rainbow', vmin=0, vmax=1000, size=12, title=''):
fig, axs = plt.subplots(1,2,figsize=(10,5))
for i,ax in enumerate(axs.ravel()):
ax.spines['left'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
axs[0].scatter(PV_geometry['x'], PV_geometry['y'], c=dot_color,
s=size, cmap=cmap, vmin=vmin, vmax=vmax)
axs[0].set_title('physical location')
axs[1].scatter(pca_irrad_sunny_days[:,0], pca_irrad_sunny_days[:,1], c=dot_color,
s=size, cmap=cmap, vmin=vmin, vmax=vmax)
axs[1].set_title('PCA coordinates')
fig.suptitle(title)
column_to_plot=46
PV_double_scatter(PV_irrad_sunny_days.iloc[:,column_to_plot],
title='Irradiance on ' + PV_irrad_sunny_days.columns[column_to_plot] )
from sklearn.cluster import KMeans
Each roofbay has 8 MPPTs which can accomodate between 35 and 220 PV panels. We want our clustering algorithm to create eight clusters.
n_clust = 8
We feed the clustering algorithm the dataframe containing panel irradiance for 73 daylight hours on seven sunny days. The KMeans algorithm creates groups that minimize eucledian distance between each panel's irradiance vector and the centroid for the group. The result is eight groups of panels where each panels irradiance is closer to the center of the group to which it belongs than any other group. The grouped PV panels therefore have simmilar irradiance profiles to each other.
%%time
clust_km = KMeans(n_clusters=n_clust)
clust_km.fit(PV_irrad_sunny_days)
PV_double_scatter(clust_km.labels_, cmap='tab10', vmax=7,
title='Clusters Determined by Kmeans'
)
The KMeans algortihm created groups in less than a second. However, there is no way to constrain the sizes of the groups with KMeans (or any other common clustering algorithm) and several of the created groups violate the size constraints of the MPPTs. For example, the group along the upper left edge has less than 35 panels, and the two groups in the center have more than 220 panels.
An iterative reassignment process was used to move PVs to other groups. The process is as follows:
The last step moves back towards the original KMeans result, but we would like a solution where panels are in a prefered group, however we don't continue to calculate new centroid and reasign repeatedly, because that eventually returns us to the Kmeans result.
We also use an iteratively incremented value that causes the process to reassign one more panel in each iteration. This step acts to prevent the process from getting stuck at an equilibrium that repeats the same result without meeting the constraints.
cluster_minimum = 35
cluster_maximum = 220
The following functions are used to perform the steps in the iterative reassignment process.
def meets_size_constraints(cluster_sizes, min_size, max_size):
'''checks each cluster to ensure it meets minimum and maximum size constraints.
Returns True if all clusters meet constraints, otherwise False'''
meets_constraints = [ a[0] >= min_size and a[0] <= max_size for a in cluster_sizes ]
return(all(meets_constraints))
def calc_cluster_distances(datapoints, centers):
'''calculates euclidean distance from data points to all cluster centers.
returns numpy array of distances.
datapoins is an MxD array
centers in an NxD array
returns an MxN array of distances from points (M) to centers (N)
vectorized implementation borrowed from:
https://medium.com/@souravdey/l2-distance-matrix-vectorization-trick-26aa3247ac6c'''
squared_distances = (-2 * np.dot(datapoints, centers.T)
+ np.sum(centers**2, axis=1)
+ np.sum(datapoints**2, axis=1)[:, np.newaxis])
distances = squared_distances ** 0.5
return(distances)
def gather_more(distances, cluster_index, new_total):
'''sets the distance of new_total closest points to centroid to zero so that the points are asigned
to that centroid, increasing the number of points in a cluster to new_total'''
ordered = sorted(distances[:,cluster_index])
cutoff = ordered[new_total]
distances[ distances[:,cluster_index] < cutoff, cluster_index ] = 0
return(distances)
def shed_some(distances, cluster_index, new_total):
'''sets the distance of points further than the new_total number of closest points to a infinity
so that these points are asinged to a different cluster, liminting the number of points in the
selected cluster to new_total'''
ordered = sorted(distances[:,cluster_index])
cutoff = ordered[new_total]
distances[ distances[:,cluster_index] >= cutoff, cluster_index ] = np.inf
return(distances)
def assign_to_closest(distances):
'''takes the index of the shortest distance for each data point and assigns the data point to that cluster'''
cluster_id = np.array([ d.argmin() for d in distances ])
return(cluster_id)
def calc_cluster_centers(datapoints, num_clusters, cluster_id):
'''calculates new cluster centroids by averaging datapoint coordinates assigned to a cluster '''
centroids = [[ np.mean(d) for d in datapoints[ cluster_id == cls_id].T ] for cls_id in range(num_clusters) ]
return(np.array(centroids))
def calc_inertia(distances, cluster_id):
'''calculates the inertia (summed squared distance of datapoints to its assigned cluster)'''
square_distances = [ dist[cls_id]**2 for dist, cls_id in zip(distances, cluster_id)]
inertia = np.sum(square_distances)
return(inertia)
def calc_cluster_sizes(cluster_ids, num_clusters):
cluster_sizes = np.array([ [sum( cluster_ids == i ),i] for i in range(num_clusters)])
return(cluster_sizes)
#assign the dataframe to a numpy array to use vectorization
clustering_datapoints = PV_irrad_sunny_days.to_numpy()
#initalize the clusters with the kmeans result
cluster_ids = clust_km.labels_
cluster_centroids = clust_km.cluster_centers_
#calculate a distance matrix
cluster_distances = calc_cluster_distances(clustering_datapoints, cluster_centroids)
#calculate the initial cluster sizes
cluster_sizes = calc_cluster_sizes(cluster_ids, n_clust)
#calculate the initial inertia
inertia = calc_inertia(cluster_distances, cluster_ids)
#print sizes and initial inertia
print(cluster_sizes[:,0], inertia)
%%time
#This cell performs the iterative reassignment.
#itr keeps track of how many iterations we've done.
itr=0
#while constrains are not met (and iterator is less than the maximum size)
while not meets_size_constraints(cluster_sizes, cluster_minimum, cluster_maximum) and itr < cluster_maximum:
for clust in cluster_sizes:
if clust[0] > cluster_maximum:
#if cluster exceeds max size, reduce size of cluster to (maximum - itr)
cluster_distances = shed_some(cluster_distances, clust[1], cluster_maximum-itr)
elif clust[0] < cluster_minimum:
#if cluster is less than min size, increase size of cluster to (minimum + itr)
cluster_distances = gather_more(cluster_distances, clust[1], cluster_minimum+itr)
#modifying max and min size with itr allows us to slowly increase the severity of the cluster
#reassignment avoiding getting stuck in equilibrium.
cluster_ids = assign_to_closest(cluster_distances)
cluster_centroids = calc_cluster_centers(clustering_datapoints, n_clust, cluster_ids)
cluster_distances = calc_cluster_distances(clustering_datapoints, cluster_centroids)
pre_inertia = calc_inertia(cluster_distances, cluster_ids)
cluster_ids = assign_to_closest(cluster_distances)
cluster_sizes = calc_cluster_sizes(cluster_ids, n_clust)
inertia = calc_inertia(cluster_distances, cluster_ids)
print(itr, cluster_sizes[:,0], f'{pre_inertia:0.1f}', f'{inertia:0.1f}')
itr+=1
PV_double_scatter(cluster_ids, cmap='tab10', vmax=7,
title='Clusters Determined by Kmeans with iterative reasignment'
)
from sklearn.cluster import DBSCAN
Someone suggested exploring DBscan as a method to cluster based on irradiance profile. DBscan finds clusters of higher density sepearted by regions of lower density. The plot below shows irradiance for two times (9:30 and 15:30) on May 21. You can see two distinct clusters of values, the group with low y-values result from a shadow on the roof at 15:30. DBscan easily seperates these into two clusters, so you can see how DBscan could potentially be a good method for clustering based on shadow pattern.
However, when you look at the PCA diagram above there is a region of high density and a region of low density. The shadow moves across the roof slowly hour by hour, and taken in totality, there aren't multiple regions of high density. The result with DBscan on the 73-dimensional dataset is a single large cluster, with two small clusters along the top left edge.
Another problem with DBscan for this application is the inability to pre-determine the number of clusters. The system design has 8 electrical groups, so we want exactly that number of clusters.
PV_scatter(PV_irrad_sunny_days.iloc[:,42], PV_irrad_sunny_days.iloc[:,48],
color='red',
size=20,
vmax=irrad_max,
title='Irradiance for:\n'+PV_irrad_sunny_days.columns[41]+' vs. '+PV_irrad_sunny_days.columns[47])
clust_db2d = DBSCAN(eps=20, min_samples=1)
clust_db2d.fit(np.array([PV_irrad_sunny_days.iloc[:,42], PV_irrad_sunny_days.iloc[:,48]]).T)
PV_scatter(PV_irrad_sunny_days.iloc[:,42], PV_irrad_sunny_days.iloc[:,48],
color=clust_db2d.labels_, cmap='tab10', vmax=clust_db2d.labels_.max(),
title='Clusters Determined by DBSCAN')
PV_double_scatter(clust_db2d.labels_, cmap='tab10', vmax=clust_db2d.labels_.max(), title='Clusters Determined by DBSCAN')
clust_db = DBSCAN(eps=300, leaf_size=30, min_samples=5)
clust_db.fit(PV_irrad_sunny_days)
PV_double_scatter(clust_db.labels_, cmap='tab10', vmax=7, title='Clusters Determined by DBSCAN')
The other way to potenitally create groups is through optimization. The benefit of optimization is that you can include group size constraints as a penalty in the fitness function.
Below are two half-hearted attempts at generating groups with differential evolution optimizer. I could probably get a better result by improving the fitness function. I could probably improve the result by writing a discrete optimizer, or using a package with one. But based on these initial result, I don't think its worth the effort. These optimizations take hours to give a not good result.
First trying a genetic algorithm using a continuous variable between 0-8. The group is asigned by casting to integers using floor to determine groupings. Very much a cludge but worth a try.
from scipy.optimize import differential_evolution
cluster_ids = np.random.randint(0,n_clust,(1175))
bounds = [ (0,7.9999) for i in range(len(cluster_ids))]
clustering_datapoints = PV_irrad_sunny_days.to_numpy()
def cost_function(cluster_floats):
'''cost function for optimization, returns inertia multiplied by a penalty'''
global clustering_datapoints, n_clust, cluster_minimum, cluster_maximum
cluster_ids = np.array(np.floor(cluster_floats), dtype=np.int)
cluster_centroids = calc_cluster_centers(clustering_datapoints, n_clust, cluster_ids)
cluster_distances = calc_cluster_distances(clustering_datapoints, cluster_centroids)
cluster_sizes = calc_cluster_sizes(cluster_ids, n_clust)
inertia = calc_inertia(cluster_distances, cluster_ids)
if not meets_size_constraints(cluster_sizes, cluster_minimum, cluster_maximum):
inertia *= 2
return inertia
%%time
ga_result = differential_evolution(cost_function, bounds, maxiter=1000, popsize=14, workers=7 )
ga_result
PV_double_scatter(np.floor(ga_result.x), cmap='tab10', vmax=7,
title='Clusters Determined by GA on group assignment'
)
Ha Ha! Yeah, no. The fitness function didn't improve much beyond the random assignment.
This attempt instead tries to find cluster centroids with that minimize the cost function. there are 8*73 values to optimize.
cl_cent_shape = np.shape(clust_km.cluster_centers_)
ga_centroids = clust_km.cluster_centers_.ravel()
bounds2 = np.array([[ (min, max) for min, max in zip(PV_irrad_sunny_days.min(), PV_irrad_sunny_days.max()) ]
for i in range(cl_cent_shape[0]) ]).reshape(len(ga_centroids),2)
def cost_function2(ga_centroids_flat):
'''cost function for optimization, returns inertia multiplied by a penalty'''
global clustering_datapoints, n_clust, cluster_minimum, cluster_maximum, cl_cent_shape
ga_centroids = ga_centroids_flat.reshape(cl_cent_shape)
ga_distances = calc_cluster_distances(clustering_datapoints, ga_centroids)
ga_ids = assign_to_closest(ga_distances)
ga_sizes = calc_cluster_sizes(ga_ids, n_clust)
inertia = calc_inertia(ga_distances, ga_ids)
if not meets_size_constraints(ga_sizes, cluster_minimum, cluster_maximum):
inertia = inertia * 2
return inertia
%%time
ga2_result = differential_evolution(cost_function2, bounds2, maxiter=100000, popsize=14, workers=7)
ga2_result
ga_distances = calc_cluster_distances(clustering_datapoints, ga2_result.x.reshape(cl_cent_shape))
ga_ids = assign_to_closest(ga_distances)
ga_sizes = calc_cluster_sizes(ga_ids, n_clust)
inertia = calc_inertia(ga_distances, cluster_ids)
cost = cost_function2(ga2_result.x)
print(inertia, cost)
PV_double_scatter(ga_ids, cmap='tab10', vmax=7,
title='Clusters Determined by GA on cluster centroids'
)